Adapting a search classifier based on user queries

ABSTRACT

Multiple different user queries are applied to an automated classifier to identify multiple tasks. For each query, a task is provided to a user. A task selected by the user is logged and a mapping between each query and each selected task is generated. Fewer than all of the mappings are used to train a new classifier, wherein selecting fewer than all of the mappings to train the new classifier comprises selecting mappings based on when the mappings were generated. The new classifier is stored on a computer-readable storage medium.

REFERENCE TO RELATED APPLICATIONS

This application is a divisional of and claims priority from U.S. patentapplication Ser. No. 10/310,408, filed on Dec. 5, 2002 and entitledMETHOD AND APPARATUS FOR ADAPTING A SEARCH CLASSIFIER BASED ON USERQUERIES.

BACKGROUND OF THE INVENTION

The present invention relates to text classifiers. In particular, thepresent invention relates to the classification of user queries.

In the past, search tools have been developed that classify user queriesto identify one or more tasks or topics that the user is interested in.In some systems, this was done with simply key-word matching in whicheach key word was assigned to a particular topic. In other systems, moresophisticated classifiers have been used that use the entire query tomake a determination of the most likely topic or task that the user maybe interested in. Examples of such classifiers include support vectormachines that provide a binary classification relative to each of a setof tasks. Thus, for each task, the support vector machine is able todecide whether the query belongs to the task or not.

Such sophisticated classifiers are trained using a set of queries thathave been classified by a librarian. Based on the queries and theclassification given by the librarian, the support vector machinegenerates a hyper-boundary between those queries that match to the taskand those queries that do not match to the task. Later, when a query isapplied to the support vector machine for a particular task, thedistance between the query and the hyper-boundary determines theconfidence level with which the support vector machine is able toidentify the query as either belonging to the task or not belonging tothe task.

Although the training data provided by the librarian is essential toinitially training the support vector machine, such training data limitsthe performance of the support vector machine over time. In particular,training data that includes current-events queries becomes dated overtime and results in unwanted topics or tasks being returned to the user.Although additional librarian-created training data can be added overtime to keep the support vector machines current, such maintenance ofthe support vector machines is time consuming and expensive. As such, asystem is needed for updating search classifiers that requires lesshuman intervention, while still maintaining a high standard of precisionand recall.

SUMMARY OF THE INVENTION

Multiple different user queries are applied to an automated classifierto identify multiple tasks. For each query, a task is provided to auser. A task selected by the user is logged and a mapping between eachquery and each selected task is generated. Fewer than all of themappings are used to train a new classifiers wherein selecting fewerthan all of the mappings to train the new classifier comprises selectingmappings based on when the mappings were generated. The new classifieris stored on a computer-readable storage medium.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a computing device on which a user mayenter a query under the present invention.

FIG. 2 is a block diagram of a client-server architecture under oneembodiment of the present invention.

FIG. 3 is a flow diagram of a method of logging search queries andselected tasks under embodiments of the present invention.

FIG. 4 is a display showing a list of tasks provided to the user inresponse to their query.

FIG. 5 is a flow diagram of a system for training a classifier usinglogged search queries under embodiments of the present invention.

FIG. 6 is a display showing an interface for designating the trainingdata to be used in building a classifier under one embodiment of thepresent invention.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

The present invention may be practiced within a single computing deviceor in a client-server architecture in which the client and servercommunicate through a network. FIG. 1 provides a block diagram of asingle computing device on which the present invention may be practicedor which may be operated as the client in a client-server architecture.

The computing system environment 100 is only one example of a suitablecomputing environment and is not intended to suggest any limitation asto the scope of use or functionality of the invention. Neither shouldthe computing environment 100 be interpreted as having any dependency orrequirement relating to any one or combination of components illustratedin the exemplary operating environment 100.

The invention is operational with numerous other general purpose orspecial purpose computing system environments or configurations.Examples of well-known computing systems, environments, and/orconfigurations that may be suitable for use with the invention include,but are not limited to, personal computers, server computers, hand-heldor laptop devices, multiprocessor systems, microprocessor-based systems,set top boxes, programmable consumer electronics, network PCs,minicomputers, mainframe computers, telephony systems, distributedcomputing environments that include any of the above systems or devices,and the like.

The invention may be described in the general context ofcomputer-executable instructions, such as program modules, beingexecuted by a computer. Generally, program modules include routines,programs, objects, components, data structures, etc. that performparticular tasks or implement particular abstract data types. Theinvention may also be practiced in distributed computing environmentswhere tasks are performed by remote processing devices that are linkedthrough a communications network. In a distributed computingenvironment, program modules may be located in both local and remotecomputer storage media including memory storage devices.

With reference to FIG. 1, an exemplary system for implementing theinvention includes a general-purpose computing device in the form of acomputer 110. Components of computer 110 may include, but are notlimited to, a processing unit 120, a system memory 130, and a system bus121 that couples various system components including the system memoryto the processing unit 120. The system bus 121 may be any of severaltypes of bus structures including a memory bus or memory controller, aperipheral bus, and a local bus using any of a variety of busarchitectures. By way of example, and not limitation, such architecturesinclude Industry Standard Architecture (ISA) bus, Micro ChannelArchitecture (MCA) bus, Enhanced ISA (EISA) bus, Video ElectronicsStandards Association (VESA) local bus, and Peripheral ComponentInterconnect (PCI) bus also known as Mezzanine bus.

Computer 110 typically includes a variety of computer readable media.Computer readable media can be any available media that can be accessedby computer 110 and includes both volatile and nonvolatile media,removable and non-removable media. By way of example, and notlimitation, computer readable media may comprise computer storage mediaand communication media. Computer storage media includes both volatileand nonvolatile, removable and non-removable media implemented in anymethod or technology for storage of information such as computerreadable instructions, data structures, program modules or other data.Computer storage media includes, but is not limited to, RAM, RON,EEPROM, flash memory or other memory technology, CD-ROM, digitalversatile disks (DVD) or other optical disk storage, magnetic cassettes,magnetic tape, magnetic disk storage or other magnetic storage devices,or any other medium which can be used to store the desired informationand which can be accessed by computer 110. Communication media typicallyembodies computer readable instructions, data structures, programmodules or other data in a modulated data signal such as a carrier waveor other transport mechanism and includes any information deliverymedia. The term “modulated data signal” means a signal that has one ormore of its characteristics set or changed in such a manner as to encodeinformation in the signal. By way of example, and not limitation,communication media includes wired media such as a wired network ordirect-wired connection, and wireless media such as acoustic, RF,infrared and other wireless media. Combinations of any of the aboveshould also be included within the scope of computer readable media.

The system memory 130 includes computer storage media in the form ofvolatile and/or nonvolatile memory such as read only memory (ROM) 131and random access memory (RAM) 132. A basic input/output system 133(BIOS) , containing the basic routines that help to transfer informationbetween elements within computer 110, such as during start-up, istypically stored in ROM 131. RAM 132 typically contains data and/orprogram modules that are immediately accessible to and/or presentlybeing operated on by processing unit 120. By way of example, and notlimitation, FIG. 1 illustrates operating system 134, applicationprograms 135, other program modules 136, and program data 137.

The computer 110 may also include other removable/non-removablevolatile/nonvolatile computer storage media. By way of example only,FIG. 1 illustrates a hard disk drive 141 that reads from or writes tonon-removable, nonvolatile magnetic media, a magnetic disk drive 151that reads from or writes to a removable, nonvolatile magnetic disk 152,and an optical disk drive 155 that reads from or writes to a removable,nonvolatile optical disk 156 such as a CD ROM or other optical media.Other removable/non-removable, volatile/nonvolatile computer storagemedia that can be used in the exemplary operating environment include,but are not limited to, magnetic tape cassettes, flash memory cards,digital versatile disks, digital video tape, solid state RAM, solidstate ROM, and the like. The hard disk drive 141 is typically connectedto the system bus 121 through a non-removable memory interface such asinterface 140, and magnetic disk drive 151 and optical disk drive 155are typically connected to the system bus 121 by a removable memoryinterface, such as interface 150.

The drives and their associated computer storage media discussed aboveand illustrated in FIG. 1, provide storage of computer readableinstructions, data structures, program modules and other data for thecomputer 110. In FIG. 1, for example, hard disk drive 141 is illustratedas storing operating system 144, application programs 145, other programmodules 146, and program data 147. Note that these components can eitherbe the same as or different from operating system 134, applicationprograms 135, other program modules 136, and program data 137. Operatingsystem 144, application programs 145, other program modules 146, andprogram data 147 are given different numbers here to illustrate that, ata minimum, they are different copies.

A user may enter commands and information into the computer 110 throughinput devices such as a keyboard 162, a microphone 163, and a pointingdevice 161, such as a mouse, trackball or touch pad. Other input devices(not shown) may include a joystick, game pad, satellite dish, scanner,or the like. These and other input devices are often connected to theprocessing unit 120 through a user input interface 160 that is coupledto the system bus, but may be connected by other interface and busstructures, such as a parallel port, game port or a universal serial bus(USB). A monitor 191 or other type of display device is also connectedto the system bus 121 via an interface, such as a video interface 190.In addition to the monitor, computers may also include other peripheraloutput devices such as speakers 197 and printer 196, which may beconnected through an output peripheral interface 190.

The computer 110 may operate in a networked environment using logicalconnections to one or more remote computers, such as a remote computer180. The remote computer 180 may be a personal computer, a hand-helddevice, a server, a router, a network PC, a peer device or other commonnetwork node, and typically includes many or all of the elementsdescribed above relative to the computer 110. The logical connectionsdepicted in FIG. 1 include a local area network (LAN) 171 and a widearea network (WAN) 173, but may also include other networks. Suchnetworking environments are commonplace in offices, enterprise-widecomputer networks, intranets and the Internet.

When used in a LAN networking environment, the computer 110 is connectedto the LAN 171 through a network interface or adapter 170. When used ina WAN networking environment, the computer 110 typically includes amodem 172 or other means for establishing communications over the WAN173, such as the Internet. The modem 172, which may be internal orexternal, may be connected to the system bus 121 via the user inputinterface 160, or other appropriate mechanism. In a networkedenvironment, program modules depicted relative to the computer 110, orportions thereof, may be stored in the remote memory storage device. Byway of example, and not limitation, FIG. 1 illustrates remoteapplication programs 185 as residing on remote computer 180. It will beappreciated that the network connections shown are exemplary and othermeans of establishing a communications link between the computers may beused.

FIG. 2 provides a block diagram of a client-server architecture underone embodiment of the present invention. In FIG. 2, a user 200 enters aquery using a client computing device 202. Client 202 communicates thequery through a network 206 to a search classifier 204, which uses a setof classifier models stored in model storage 208 to classify the userquery. Under one embodiment, the classifier models are support vectormachines.

As shown in the flow diagram of FIG. 3, when search classifier 204receives a search query at step 300, it identifies a set of tasks thatmay be represented by the query and returns those identified tasks tothe users at step 302. In embodiments in which support vector machinesare used, the query is applied to a separate support vector machine foreach task, and each separate support vector machine determines whetherthe query is likely related to a particular task and the confidencelevel of that determination. This confidence level is typicallydetermined by determining the distance between a vector representing thequery and a hyper-boundary defined within the support vector machine.

At step 304 of FIG. 3, search classifier 204 logs the query and thelists of tasks returned to the client 202 in a log 210. Typically, thislog entry includes a session ID that uniquely but abstractly identifiesa client 202 such that further communications from the same client willshare the same session ID. In most embodiments, the session ID is notable to identify a particular user.

In step 306 of FIG. 3, client 202 displays the returned task to the userso that the user may select one or more of the tasks. An example of sucha display is shown in FIG. 4 where tasks 400, 402, and 404 are showndisplayed near a text edit box 408 containing the user's original queryNote that in some embodiments, the query is simultaneously applied to asearch engine, which provides a set of results 410 that is displayednext to the identified tasks.

At step 308 of FIG. 3, if a user does not select a task, the processreturns to step 300 where the search classifier waits for a new query tobe submitted by one or more users. If a user does select a task at step308, search classifier 204 logs the selected task at step 310. After theselected task has been logged at step 308, the process returns to a loopbetween steps 308 and 300 wherein the search classifier waits for one ormore users to select a task previously returned to the user and/or waitsfor a new query from a user.

Over time, log 210 grows in size to include log entries from many usersover many different search sessions. After a period of time, typically aweek, log 210 is used to build a new classifier as shown in the steps ofFIG. 5.

At step 500 of FIG. 5, a log parser 212 accesses log 210 and parses thelog to find entries in which a task was returned to a user and asubsequent entry in which a task was selected by the user. Note that theuser is able to select more than one task and as such there may bemultiple entries for different selected tasks based on a single query. Aselected task is identified by matching the task to a task returned inan earlier log entry for the same session ID.

At step 502, log parser 212 applies each query that resulted in aselected task to the classifier model stored in storage 208 to determinethe confidence level of the task selected by the user. The query, taskand confidence level are then stored in a database 214.

The query and selected task represent an unsupervised query-to-taskmapping. This mapping is unsupervised because it is generatedautomatically without any supervision as to whether the selected task isappropriate for the query.

Under one embodiment, query-to-task mappings stored in database 214 arestored with a confidence bucket indicator that indicates the generalconfidence level of the query-to-task mapping. In particular, a separatebucket is provided for each of the following ranges of confidencelevels: 50-60%, 60-70%, 70-80%, 80-90% and 90-100%. These confidencebuckets are shown as buckets 216, 218, 220, 222 and 224 in FIG. 2. Thestep of assigning query-to-task mappings to buckets is shown as step 504in FIG. 5.

Using a build interface 230, a build manager 232 selects a combinationof training data at step 506. FIG. 6 provides an example of a buildinterface used by a build manager to designate the training data to beused in building a candidate classifier.

Under the embodiment of FIG. 6, the training data is designated on a pertask basis. As such, a task selection box 650 is provided in which thebuild manager can designate a task. Note that in other embodiments, thistask designation is not used and a single designation of the trainingdata is applied to all of the tasks.

In FIG. 6, check boxes 600, 602, 604, 606, 608 and 610 correspond toportions of the original training data that were formed by a librarianand used to construct the original classifier. These original sets oftraining data are shown as original librarian data 233 in FIG. 2. Checkbox 612 allows the build manager 232 to designate a set of query-to-taskmappings that have appeared multiple times in the log. Such multiplemappings are designated by log parser 212 as being duplicates 234.

Check box 614 allows build manager 232 to select training data that hasbeen newly created by a librarian. In other words, a librarian hasassociated a task with a query and that mapping has been stored as newmanual training data 236 in FIG. 2. Check boxes 616, 618, 620, 622 and624 allow build manager 232 to select the training data that has beenassigned to the buckets associated with 50-60%, 60-70%, 70-80%, 80-90%and 90-100% confidence levels, respectively.

Under one embodiment, build interface 230 uses the selections made inthe check boxes of FIG. 6 to construct a vector representing theinformation contained in the check boxes. Under this embodiment, eachbit position in the vector represents a single check box in FIG. 6, andthe bit position has a one when the check box has been selected and azero when the check box has not been selected. This vector is passed toa build script 238 so that the build script knows which training datahas been selected by the build manager.

Build interface 230 also includes a freshness box 652, which allows thebuild manager to designate the percent of the training data that is tobe used in constructing the classifier. This percentage represents thelatest x percent of the training data that was stored in the log. Forexample, if the percentage is set at twenty percent, the latest 20percent of task mappings that are found in the database are used toconstruct the classifier. Thus, the freshness box allows the buildmanager to select the training data based on when the mappings wereproduced.

Freshness box 652 allows the build manager to tailor how much oldtraining data will be used to construct the classifier. In addition, inembodiments where the training data is specified on a per task basisusing task selection box 650, it is possible to set different freshnesslevels for different tasks. This is helpful because some tasks arehighly time-specific and their queries change significantly over timemaking it desirable to use only the latest training data. Other tasksare not time-specific and their queries change little over time. Forthese tasks it is desirable to use as much training data as possible toimprove the performance of the classifier.

Based on the check boxes selected in build interface 230, build script238 retrieves the query-to-task mappings with the appropriatedesignations 216, 218, 220, 222, 224, 233, 234 and/or 236 and uses thosequery-to-task mappings to build a candidate classifier 240 at step 508.

Candidate classifier 240 is provided to a tester 242, which at step 510of FIG. 5 measures the precision, recall and FeelGood performance ofcandidate classifier 240. Precision provides a measure of theclassifier's ability to return only those tasks that are truly relatedto a query and not other unrelated tasks. Recall performance provides ameasure of the candidate classifier's ability to return all of the tasksthat are associated with a particular query. “FeelGood” is a metric thatindicates, for a given known test query, whether the associated mappedtask would appear as one of the top 4 tasks returned to an end user. IfYes, the mapping is scored a value of 1.0. If no, the mapping is scoreda value of 0.0. Averaging this value over the entire testing set,produces a value between zero and one. For well-selected training setsthis average is around 85%, meaning that 85 queries out of 100 causedthe proper task to appear in the top 4.

Under one embodiment, the step of testing the candidate classifier atstep 510 is performed using a “holdout” methodology. Under this method,the selected training data is divided into N sets. One of the sets isselected and the remaining sets are used to construct a candidateclassifier. The set of training data that was not used to build theclassifier is then applied to the classifier to determine its precision,recall and FeelGood performance. This is repeated for each set of datasuch that a separate classifier is built for each set of data that isheld out. The performance of the candidate classifier is then determinedas the average precision, recall, and FeelGood performance of each ofthe candidate classifiers generated for the training data.

At step 512, the build interface 230 is provided to build manager 232once again so that the build manager may change the combination oftraining data used to construct the candidate classifier. If the buildmanager selects a new combination of training data, the process returnsto step 506 and a new candidate classifier is constructed and tested.

When the build manager has tested all of the desired combinations oftraining data, the best candidate classifier is selected at step 514.The performance of this best candidate is then compared to theperformance of the current classifier at step 516. If the performance ofthe current classifier is better than the performance of the candidateclassifier, the current classifier is kept in place at step 518. If,however, the candidate classifier performs better than the currentclassifier, the candidate classifier is designated as a releasecandidate 243 and is provided to a rebuild tool 244. At step 520,rebuild tool 244 replaces the current classifier with release candidate243 in model storage 208. In many embodiments, the changing of theclassifier stored in model storage 208 is performed during non-peaktimes. When the search classifier is operated over multiple servers, thechange in classifiers is performed in a step-wise fashion across each ofthe servers.

Thus, the present invention provides a method by which a searchclassifier may be updated using query-to-task mappings that have beendesignated by the user as being useful. As a result, the classifierimproves in performance and is able to change over time with new queriessuch that it is no longer limited by the original training data usedduring the initial construction of the search classifier. As a result,less manually entered training data is needed under the presentinvention in order to update and expand the performance of theclassifier.

While the present invention has been described with reference to queriesand tasks, those skilled in the art will recognize that a query issimply one type of example that can be used by an example-basedcategorizer such as the one described above and a task is just oneexample of a category. Any type of example and any type of category maybe used with the present invention.

Although the present invention has been described with reference toparticular embodiments, workers skilled in the art will recognize thatchanges may be made in form and detail without departing from the spiritand scope of the invention.

1. A computer-readable storage medium having computer-executableinstructions for performing steps comprising: applying multipledifferent user queries to an automated classifier to identify multipletasks, each user query comprising at least one word; for each userquery: providing a task identified for the user query to a user; logginga task selected by the user; generating a mapping between each query andeach selected task; selecting fewer than all of the mappings to train anew classifier, wherein selecting fewer than all of the mappings totrain the new classifier comprises selecting mappings based on when themappings were generated; and storing the new classifier on acomputer-readable storage medium, the new classifier for identifying atleast one task from a user query.
 2. The computer-readable storagemedium of claim 1 further comprising using a first set of mappings totrain a first new classifier and a second set of mappings, differentfrom the first set of mappings, to train a second new classifier.
 3. Thecomputer-readable storage medium of claim 2 further comprising testingthe first new classifier and the second new classifier to determinewhich performs better.
 4. The computer-readable storage medium of claim1 wherein training a classifier comprises setting different trainingparameters for different tasks.
 5. The computer-readable storage mediumof claim 4 wherein setting a training parameter for a first taskcomprises selecting a first percentage of mappings produced for thefirst task, and setting a training parameter for a second task comprisesselecting a second percentage of mappings produced for the second task,the first percentage being different from the second percentage.
 6. Amethod comprising: applying multiple different user queries to anautomated classifier to identify multiple tasks; for each query,providing a task identified for the query to a user; for at least twoqueries, logging a task selected by the user; generating a mappingbetween each query for which a task was selected and each selected task;selecting fewer than all of the mappings to train a new classifier byselecting mappings based on when the mappings were generated; andstoring the new classifier on a computer-readable storage medium, thenew classifier for identifying at least one task from a user query. 7.The method of claim 6 further comprising using a first set of mappingsto train a first new classifier and a second set of mappings, differentfrom the first set of mappings, to train a second new classifier.
 8. Themethod of claim 7 further comprising testing the first new classifierand the second new classifier to determine which performs better.
 9. Themethod of claim 6 wherein training a classifier comprises settingdifferent training parameters for different tasks.
 10. The method ofclaim 9 wherein setting a training parameter for a first task comprisesselecting a first percentage of mappings produced for the first task,and setting a training parameter for a second task comprises selecting asecond percentage of mappings produced for the second task, the firstpercentage being different from the second percentage.
 11. A methodcomprising: receiving input designating a first percentage of mappingsbetween a first task and a first set of queries that is to be used totrain a classifier, the first percentage less than one-hundred percent;receiving input designating a second percentage of mappings between asecond task and a second set of queries that is to be used to train theclassifier, the second percentage less than one-hundred percent;retrieving the first percentage of mappings between the first task andthe first set of queries by selecting the latest formed mappings betweenthe first task and the first set of queries up to the first percentage;retrieving the second percentage of mappings between the second task andthe second set of queries by selecting the latest formed mappingsbetween the second task and the second set of queries up to the secondpercentage; using the retrieved mappings to train a classifier forclassifying a query into at least one task; and storing the classifieron a computer-readable storage medium.
 12. The method of claim 11further comprising forming mappings between the first task and the firstset of queries through steps comprising: receiving a query from a user;identifying a task for the query and displaying the task to the user;logging a task selected by the user and the query; and using the loggedtask and the query to form the mappings.